Global parameters for modeling

Exploratory Plots

Some exploratory plots of un-modeled gene counts.

Before you do any statistical analysis, it’s important to know what your data looks like. This is important for two main reasons:

We’re going to look at the raw (and mostly raw) gene count data in a few ways.

Heatmaps

In all the following heatmaps, samples or genes are clustered by Euclidean distance of rlog-transformed gene counts (this removed the dependence of variance on mean, and flattens out the variance across the data set, which makes the heatmaps a little easier to read). Transformed counts were not used for the DESeq2 analysis.

Individuals clustered by overall expression

This is a heatmap clustering individuals by their expression levels for all genes. Essentially, this is a correlation plot. Each individual is perfectly correlated with itself (the dark blue diagonal). This plot shows two big groups. One that is all cannibals (on the left), one that is all non-cannibals (on the right). Families tend to cluster within feeding strategies/groups.

Individuals by Top 500 genes heatmap

Here we’re using the same transformed counts to look at overall expression, however I’ve filtered out all but the top 500 most highly expressed genes (on average, across all individuals). Here, I’m only clustering by individuals on the x-axis, where the x-axis is individuals, but the y-axis is transcripts. Here, we can see that in the 500 most expressed genes, there are very few that have highly varying expression across samples. However, in these 500 genes, we can better cluster non-cannibals vs cannibals. There is still pretty good clustering of families within feeding strategy.

Individuals clustered by top 500 gene expression

This plot is analogous to the first heatmap, in that I’m now showing a correlation matrix, however this is using just the top 500 most expressed genes rather than all of them. Clustering with the top 500 genes is giving us three big expression groups: cannibals, non-cannibals and a mix.

PCA for overall expression

Here, instead of looking at correlations, we’re doing a PCA. This takes the entire (transformed) count table and rotates it to find the set of orthagonal axes with the most variation. Here I’m always plotting PC1 vs PC2 (the most variable and second most variable axis). I’m coloring points by either family, feeding strategy, or both. Since PCA looks for the axis in the data with the most variation, what this will do is push dis-similar samples away from each other in the plot. In effect, we can see whether there are already clusters in the data from variables that we’re not interested in and want to model out.

In these plots, for instance, you can see on the top left that the Mc samples are pretty different from all the others, and that family effect might be part of why we’re getting three clusters in the last heatmap. If we don’t account for family in the DESeq model, we’ll probably get the wrong genes.

In the top right plot, you can see that non-cannibals and cannibals actually are pretty easily distinguishable just by overall gene expression, which is good. Once we account for family, we should have a fairly reliable gene list.

However, if there are other meta-data factors that differ between samples we should add them into this plot and check for those before going forward. This would be things like batch effects, lane effects, sample extraction differences, etc.

---
title: "DEanalysis_kfish_osmotic"
author: "Lisa Johnson"
date: '`r Sys.Date()`'
output:
  html_document:
    code_folding: hide
    collapsed: no
    df_print: paged
    number_sections: yes
    theme: cerulean
    toc: yes
    toc_depth: 5
    toc_float: yes
  html_notebook:
    toc: yes
    toc_depth: 5
---

```{r GlobalVariables}

#Setting a reasonable p-value threshold to be used throughout

p_cutoff <- 0.05

p_cutoff_new <- 0.05

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)

FC_cutoff_original <- 0

FC_cutoff_new <- 1


```


#Global parameters for modeling



```{r LoadPackages, results='hide', include=FALSE}

# Install function for packages    
packages<-function(x){
  x<-as.character(match.call()[[2]])
  if (!require(x,character.only=TRUE)){
    install.packages(pkgs=x,repos="http://cran.r-project.org")
    require(x,character.only=TRUE)
  }
}

bioconductors <- function(x){
    x<- as.character(match.call()[[2]])
    if (!require(x, character.only = TRUE)){
      source("https://bioconductor.org/biocLite.R")
      biocLite(pkgs=x)
      require(x, character.only = TRUE)
    }
}

packages(MASS)
packages(ggplot2)
packages(gtools)
packages(pheatmap)
packages(cowplot)
packages(RColorBrewer)
packages(dplyr)
packages(tidyr)
packages(ggrepel)
bioconductors(DESeq2)
bioconductors(limma)
bioconductors('edgeR')
packages(gplots)
packages(lattice)

sessionInfo()

```


```{r loadfiles, results='hide', include=FALSE}

# This is the counts with Experimental Design Info in the last 5 rows

if(!file.exists('../../../Ensembl_species_counts_designfactors.csv')){
  download.file("https://osf.io/7vp38/download",'Ensembl_species_counts_designfactors.csv')
}
if(!file.exists('../../../nonzero_clade_physiology_counts_design.csv')){
  download.file("https://osf.io/be7ny/download",'nonzero_clade_physiology_counts_design.csv')
}
if(!file.exists('../../../greater5counts_clade_physiology_counts_design.csv')){
  download.file("https://osf.io/be7ny/download",'greater5counts_clade_physiology_counts_design.csv')
}

#counts_design <- read.csv("Ensembl_species_counts_designfactors.csv",stringsAsFactors = TRUE)
counts_design <- read.csv("../../../nonzero_clade_physiology_counts_design.csv",stringsAsFactors = TRUE)


```


```{r designinfo, include=FALSE}
# -----------------------
# Format design and counts matrix
# -----------------------

design <- counts_design[counts_design$Ensembl == 'Empty',]
#design$type <- c("species","native_salinity","clade","group","condition")
drops <- c("X","Ensembl",
           "F_zebrinus_BW_1.quant","F_zebrinus_BW_2.quant",
           "F_zebrinus_FW_1.quant","F_zebrinus_FW_2.quant",
           "F_notti_FW_1.quant","F_notti_FW_2.quant",
           "F_sciadicus_BW_1.quant","F_sciadicus_FW_1.quant","F_sciadicus_FW_2.quant")
transfer_drops <- c("F_sciadicus_transfer_1.quant","F_rathbuni_transfer_1.quant","F_rathbuni_transfer_2.quant","F_rathbuni_transfer_3.quant",
                    "F_grandis_transfer_1.quant","F_grandis_transfer_2.quant","F_grandis_transfer_3.quant",
                    "F_notatus_transfer_1.quant","F_notatus_transfer_2.quant","F_notatus_transfer_3.quant",
                    "F_parvapinis_transfer_1.quant","F_parvapinis_transfer_2.quant",
                    "L_goodei_transfer_1.quant","L_goodei_transfer_2.quant","L_goodei_transfer_3.quant",
                    "F_olivaceous_transfer_1.quant","F_olivaceous_transfer_2.quant",
                    "L_parva_transfer_1.quant","L_parva_transfer_2.quant","L_parva_transfer_3.quant",
                    "F_heteroclitusMDPP_transfer_1.quant","F_heteroclitusMDPP_transfer_2.quant","F_heteroclitusMDPP_transfer_3.quant",
                    "F_similis_transfer_1.quant","F_similis_transfer_2.quant","F_similis_transfer_3.quant",
                    "F_diaphanus_transfer_1.quant","F_diaphanus_transfer_2.quant",
                    "F_chrysotus_transfer_1.quant","F_chrysotus_transfer_2.quant",
                    "A_xenica_transfer_1.quant","A_xenica_transfer_2.quant","A_xenica_transfer_3.quant" ,
                    "F_catanatus_transfer_1.quant","F_catanatus_transfer_2.quant",
                    "F_heteroclitusMDPL_transfer_1.quant","F_heteroclitusMDPL_transfer_2.quant","F_heteroclitusMDPL_transfer_3.quant")
counts<-counts_design[!counts_design$Ensembl == 'Empty',]
rownames(counts)<-counts$Ensembl
design <- design[ , !(names(design) %in% drops)]
counts <- counts[ , !(names(counts) %in% drops)]
design <- design[ , !(names(design) %in% transfer_drops)]
counts <- counts[ , !(names(counts) %in% transfer_drops)]
dim(design)
#[1]  5 81
dim(counts)
#[1] 30466    81
# --------------------
# design cateogories
# --------------------

species<-as.character(unlist(design[1,]))
physiology<-as.character(unlist(design[2,]))
clade<-as.character(unlist(design[3,]))
condition<-as.character(unlist(design[5,]))
condition_physiology<-as.vector(paste(condition,physiology,sep="."))
cols<-colnames(counts)
ExpDesign <- data.frame(row.names=cols,
                        condition=condition,
                        physiology = physiology,
                        clade = clade,
                        species = species,
                        sample=cols)
ExpDesign
design = model.matrix( ~ physiology + condition + physiology:condition, ExpDesign)

colnames(design)
# check rank of matrix
Matrix::rankMatrix( design )
dim(design)
```


```{r norm, results='hide', include=FALSE}
counts_round <- round(data.matrix(counts), digits=0)
#counts_round <- head(counts_round, n = 19000)
dim(counts_round)
#plot(colSums(t(counts_round)) )
lcpm2 <- cpm(counts_round, log = TRUE)
#boxplot(lcpm2, las = 2, main = "Before DESeq2 Normalization")

# transform with DESeq to stabilize variance

dds <- DESeqDataSetFromMatrix(countData = counts_round,colData = ExpDesign,design = design)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds$sizeFactor
nc <- counts(dds, normalized=TRUE)
#write.csv(nc, "../../../normcounts.csv")
counts<-nc
#plot(colSums(t(counts)) )

cpm <- cpm(counts)
lcpm <- cpm(counts, log = TRUE)
#plot(colSums(t(lcpm)) )
#boxplot(lcpm, las = 2, main = "After DESeq2 Normalization")
```


# Exploratory Plots

Some exploratory plots of un-modeled gene counts.

Before you do any statistical analysis, it's important to know what your data looks like. This is important for two main reasons:

  - It can help you find errors that will effect your downstream analysis
  - It can help you to choose an appropriate analysis for your data

We're going to look at the raw (and mostly raw) gene count data in a few ways.

## Heatmaps


In all the following heatmaps, samples or genes are clustered by Euclidean distance of rlog-transformed gene counts (this removed the dependence of variance on mean, and flattens out the variance across the data set, which makes the heatmaps a little easier to read). Transformed counts were *not* used for the DESeq2 analysis.

### Individuals clustered by overall expression

This is a heatmap clustering individuals by their expression levels for all genes. Essentially, this is a correlation plot. Each individual is perfectly correlated with itself (the dark blue diagonal). This plot shows two big groups. One that is all cannibals (on the left), one that is all non-cannibals (on the right). Families tend to cluster within feeding strategies/groups.

```{r PlainHeatmap, fig.keep="last", fig.width=11, fig.path='figures/', dev=c('png', 'pdf')}
#A useful first step in an RNA-seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design? We use the R function dist to calculate the Euclidean distance between samples. To ensure we have a roughly equal contribution from all genes, we use it on the rlog-transformed data. We need to transpose the matrix of values using t, because the dist function expects the different samples to be rows of its argument, and different dimensions (here, genes) to be columns. 
#**Note that the two transformations offered by DESeq2 are provided for applications other than differential testing.** For differential testing we recommend the DESeq function applied to raw counts, as described later in this workflow, which also takes into account the dependence of the variance of counts on the mean value during the dispersion estimation step. The function rlog returns an object based on the SummarizedExperiment class that contains the rlog-transformed values in its assay slot. http://www.bioconductor.org/help/workflows/rnaseqGene/#the-deseqdataset-object-sample-information-and-the-design-formula


rld <- vst(dds, blind = FALSE,fitType='local')
sampleDists <- dist(t(assay(rld)))
df <- as.data.frame(colData(dds)[,c("physiology","condition","clade")])
sampleDistMatrix <- as.matrix( sampleDists )
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors, annotation = df, show_rownames=F)

```

### Individuals by Top 500 genes heatmap

Here we're using the same transformed counts to look at overall expression, however I've filtered out all but the top 500 most highly expressed genes (on average, across all individuals). Here, I'm only clustering by individuals on the x-axis, where the x-axis is individuals, but the y-axis is transcripts. Here, we can see that in the 500 most expressed genes, there are very few that have highly varying expression across samples. However, in these 500 genes, we can better cluster non-cannibals vs cannibals. There is still pretty good clustering of families within feeding strategy. 


```{r MiniPlainGeneHeatmap, echo=FALSE, fig.keep="last", fig.width=11, fig.path='figures/', dev=c('png', 'pdf')}

select100 <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100]

pheatmap(assay(rld)[select100,], show_rownames=T,clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists, annotation_col=df)


```

### Individuals clustered by top 500 gene expression

This plot is analogous to the first heatmap, in that I'm now showing a correlation matrix, however this is using just the top 500 most expressed genes rather than all of them. Clustering with the top 500 genes is giving us three big expression groups: cannibals, non-cannibals and a mix.

```{r MiniPlainHeatmap, echo=FALSE, fig.keep="last", fig.width=11, fig.path='figures/', dev=c('png', 'pdf')}

select100 <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100]

sampleDists <- dist(t(assay(rld)[select100,]))
sampleDistMatrix <- as.matrix( sampleDists )

pheatmap(sampleDistMatrix, show_rownames=T,clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists, annotation_col=df)


```

### PCA for overall expression

Here, instead of looking at correlations, we're doing a PCA. This takes the entire (transformed) count table and rotates it to find the set of orthagonal axes with the most variation. Here I'm always plotting PC1 vs PC2 (the most variable and second most variable axis). I'm coloring points by either family, feeding strategy, or both. Since PCA looks for the axis in the data with the most variation, what this will do is push dis-similar samples away from each other in the plot. In effect, we can see whether there are already clusters in the data from variables that we're *not* interested in and want to model *out*. 

In these plots, for instance, you can see on the top left that the Mc samples are pretty different from all the others, and that family effect might be part of why we're getting three clusters in the last heatmap. If we don't account for family in the DESeq model, we'll probably get the wrong genes. 

In the top right plot, you can see that non-cannibals and cannibals actually are pretty easily distinguishable just by overall gene expression, which is good. Once we account for family, we should have a fairly reliable gene list. 

However, if there are other meta-data factors that differ between samples we should add them into this plot and check for those *before* going forward. This would be things like batch effects, lane effects, sample extraction differences, etc.


```{r plainPCA, fig.keep="last", fig.width=11, fig.path='figures/', dev=c('png', 'pdf')}


cowplot::plot_grid( plotPCA(rld, intgroup="condition"),
                    plotPCA(rld, intgroup="physiology"),
                    plotPCA(rld, intgroup="clade"),
                    plotPCA(rld, intgroup=c("clade","physiology")),
                           align="c", ncol=2)

```

